5-B. Sentiment Analysis: Vader¶
# pip install tweepy==4.1.0
# pip install vaderSentiment
import os
import time
import math
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import multiprocessing
from pandarallel import pandarallel
import requests
import sys
import nltk
from textblob import TextBlob
from wordcloud import WordCloud
from google.cloud import storage
from textblob.sentiments import NaiveBayesAnalyzer
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import pytz
import spacy
from collections import Counter
import concurrent.futures
import warnings
warnings.simplefilter('once')
warnings.simplefilter('ignore')
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 500)
num_processors = multiprocessing.cpu_count()
num_processors
workers = num_processors-1
print(f'Using {workers} workers')
Using 15 workers
pandarallel.initialize(nb_workers=workers, use_memory_fs=False)
INFO: Pandarallel will run on 15 workers. INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
1. Import Data¶
%%time
file_path = 'news_cleaned.parquet'
news = pd.read_parquet(file_path)
CPU times: user 20.1 s, sys: 26.6 s, total: 46.7 s Wall time: 26.3 s
news = news.reset_index(drop = True)
news.shape # (198064, 16)
(198064, 16)
news.columns
Index(['url', 'date', 'language', 'title', 'text', 'year', 'month', 'day',
'text_ner', 'text_cleaned', 'text_lemm', 'title_ner', 'title_cleaned',
'title_lemm', 'title_word_count', 'text_word_count'],
dtype='object')
news.sample(1, random_state = 42)[['text_ner', 'text_cleaned', 'text_lemm', 'title_ner', 'title_cleaned', 'title_lemm']]
| text_ner | text_cleaned | text_lemm | title_ner | title_cleaned | title_lemm | |
|---|---|---|---|---|---|---|
| 196666 | Prosecutors in all states urge Congress to strengthen tools to fight AI child sexual abuse images Skip to contentCommunity Coverage TourHome ProMedically SpeakingBest of the WestChampions in AgBack to Our AppsCOVID 19Food for NewsTexasNew to a TipLatest CamsClosings and DelaysSend Us Your Weather PhotosTxDOT Highway ConditionsDownload the Weather AppWeather ResourcesKCBD InvestigatesSubmit a TipChad Read ShootingReagor Dykes CoverageSex Trafficking on the South PlainsLubbock County Medical E... | prosecutors states urge congress strengthen tools fight ai child sexual abuse images skip contentcommunity coverage tourhome promedically speakingbest westchampions agback appscovid newstexasnew tiplatest camsclosings delayssend us weather photostxdot highway conditionsdownload weather appweather resourceskcbd investigatessubmit tipchad read shootingreagor dykes coveragesex trafficking south plainslubbock county medical examiner school beat petestats predictionshow watchcommunitytell somethi... | prosecutor state urge congress strengthen tool fight ai child sexual abuse image skip contentcommunity coverage tourhome promedically speakingbest westchampions agback appscovid newstexasnew tiplatest camsclosings delayssend u weather photostxdot highway conditionsdownload weather appweather resourceskcbd investigatessubmit tipchad read shootingreagor dyke coveragesex traffic south plainslubbock county medical examiner school beat petestats predictionshow watchcommunitytell something goodnot... | Prosecutors in all states urge Congress to strengthen tools to fight AI child sexual abuse images | prosecutors states urge congress strengthen tools fight ai child sexual abuse images | prosecutor state urge congress strengthen tool fight ai child sexual abuse image |
2. Setiment analysis with VADER: Positive, Neutral, Negative and Compound¶
Utilize a Sentiment Dictionary to decipher the sentiment of text¶
A sentiment dictionary is the mapping of words to sentiment values. For example: the word awesome (which is a positive sentiment) could have a value of +3.7 and the word horrible (which is a negative sentiment) could have a value of -3.1. While using a sentiment dictionary, the values of the sentiment words are summed to get the overall sentiment of the text.
For example: I loved the ambience of the restaurant but the drive to the restaurant was horrendous. Overall, it was a good evening.
Now let's say the value of the word love is +3.9, the value of the word horrendous is -4.2 and the value of the word good is +2.9. So, the overall sentiment of the text is positive since the aggregate of the values of the sentiment words is positive.
VADER stands for Valence Aware Dictionary for Sentiment Reasoning. The dictionary was designed specifically for Twitter and contains emoticons and slang. It also provides support for sentiment intensifiers (words such as incredibly funny) and negations (words such as "not bad" which is a slight/small positive sentiment).
How it works? VADER analyzes a piece of text to check if any of the words in the text are present in the lexicon. It produces 4 sentiment metrics from the word ratings i.e. positive, neutral, negative and compound. The compound score is the sum of all of the lexicon ratings which is standardized to a range between -1 and 1.
import nltk
nltk.download('vader_lexicon')
[nltk_data] Downloading package vader_lexicon to [nltk_data] /home/jupyter/nltk_data... [nltk_data] Package vader_lexicon is already up-to-date!
True
# Create a SentimentIntensityAnalyzer object
sid = SentimentIntensityAnalyzer()
Which text to input for sentiment analysis?¶
For sentiment analysis, particularly on news articles related to AI and job market changes, the choice between using data preprocessed for NER (Named Entity Recognition), moderately cleaned data for LDA (Latent Dirichlet Allocation), or lemmatized text, depends on the nature of the sentiment analysis tool you plan to use and the specifics of what you consider "best results". Here's a general guide to help you decide:
Minimally Cleaned Data (NER):
- Pros: Retains more context and original structure, which can be helpful for capturing sentiment related to specific entities and nuances in the text.
- Cons: May include noise that could potentially skew sentiment analysis results (e.g., irrelevant punctuation, capitalization, or rare words).
- Best for: Tools that are good at handling complex language nuances, including capitalization and entity-based sentiments (e.g., VADER, which understands capitalization as emphasis).
Moderately Cleaned Data (LDA):
- Pros: Removes stopwords and lowercases text, which could help in focusing on the more meaningful words that might carry sentiment.
- Cons: Some sentiment-bearing terms, especially intensifiers or negations, might be lost if not handled properly.
- Best for: Traditional sentiment analysis approaches that don’t handle entity recognition and rely more on the overall frequency and presence of sentiment-bearing words.
Lemmatized Text:
- Pros: Normalizes words to their base form, which can be beneficial for consistency and possibly reducing the feature space.
- Cons: Lemmatization might alter the meaning of some words, losing the sentiment in the process (e.g., changing "better" to "good" could affect the sentiment score).
- Best for: Sentiment analysis tools or models that are trained on lemmatized data or when consistency of word forms is crucial.
len(news['text_cleaned'])
198064
%%time
# Apply VADER to each piece of text and store the results in a new DataFrame
sentiments = news['text_cleaned'].parallel_apply(lambda x: sid.polarity_scores(x))
CPU times: user 1.73 s, sys: 5.23 s, total: 6.96 s Wall time: 24min 10s
# Convert the result into a DataFrame
df_sentiments = pd.DataFrame(sentiments.tolist())
df_sentiments.isnull().sum()
neg 0 neu 0 pos 0 compound 0 dtype: int64
# Define a function to interpret the compound score as Positive or Negative
def label_sentiment(row):
if row['compound'] > 0:
return 'positive'
elif row['compound'] < 0:
return 'negative'
else:
return 'neutral'
%%time
# Apply the function to determine Positive, Negative, or Neutral
df_sentiments['sentiment'] = df_sentiments.parallel_apply(label_sentiment, axis=1)
CPU times: user 49.1 ms, sys: 2.1 s, total: 2.15 s Wall time: 2.37 s
# Keep only the 'sentiment' and 'compound' columns
df_results = df_sentiments[['sentiment', 'compound']]
# Add the results back to the original DataFrame
news['vader_sent'] = df_results['sentiment']
news['vader_comp'] = df_results['compound']
news[news['vader_sent'] == 'positive'][['text_ner', 'vader_sent', 'vader_comp']].sample(3, random_state = 42)
| text_ner | vader_sent | vader_comp | |
|---|---|---|---|
| 29974 | OpenAI has its tentacles in hundreds of companies. Here s how it s making them more productive. HOME MAIL NEWS FINANCE SPORTS ENTERTAINMENT LIFE SEARCH SHOPPING YAHOO PLUS MORE ... Yahoo Finance Yahoo Finance Sign in Mail Sign in to view your mail Finance Watchlists My Portfolio Crypto Yahoo Finance Plus Dashboard Research Reports Investment Ideas Community Insights Webinars Blog News Latest News Yahoo Finance Originals Stock Market News Earnings Politics Economic News Morning Brief Personal... | positive | 0.9988 |
| 124108 | Bright Direction Dental Selects Overjet AI to Elevate Patient Care Skip to Florida WeekendWatch LiveWatch GuideSouth Florida WeekendSportsAbout UsContact UsNextGen TVProgramming ScheduleLatest Country Music LifestyleGray DC BureauInvestigate TVPress ReleasesBright Direction Dental Selects Overjet AI to Elevate Patient CarePublished Jan., at AM EST Updated hours agoThe DSO embraced technological innovation and partnered with Overjet for AI powered radiograph analysis, clinical insights, and o... | positive | 0.9989 |
| 36914 | Ai Regulation Regulators dust off rule books to tackle generative AI like ChatGPT, ET BrandEquity X We use cookies to ensure best experience for you We use cookies and other tracking technologies to improve your browsing experience on our site, show personalize content and targeted ads, analyze site traffic, and understand where our audience is coming from. You can also read our privacy policy, We use cookies to ensure the best experience for you on our website. By choosing I accept, or by c... | positive | 0.9976 |
news[news['vader_sent'] == 'negative'][['text_ner', 'vader_sent', 'vader_comp']].sample(3, random_state = 42)
| text_ner | vader_sent | vader_comp | |
|---|---|---|---|
| 72547 | Ex Florida data scientist turns herself in after arrest warrant issued Skip to content Go Local Grow with Us Expert Connections Health Connections Contests Moms Talk Baby Boomers Talk Panhandle Deals Viewers Choice Awards Home News WATCH LIVE Weather Closings Coronavirus Vaccine Watch Community Sports About Us Home Election Results Download our Apps WATCH LIVE Go Local News National Crime Education Perspective with Brent McClure Good News With Doppler Dave Coronavirus Vaccine Watch Panhandle... | negative | -0.9840 |
| 105006 | Flood forecasts in real time with block by block data could save lives a new machine learning method makes it possible Skip to main content MySA Homepage Currently Reading Flood forecasts in real time with block by block data could save lives a new machine learning method makes it possible Newsletters Sign In HomeSubscribeBuy E N MerchandiseContact UsAbout UsAdvertise With UsPlace a Classified AdPrivacy NoticeNewsletters Text AlertsFind a Business in S.A.Manage by to San AntonioClassified Ma... | negative | -0.1725 |
| 107850 | Musk, scientists call for halt to AI race sparked by ChatGPT Skip to contentTornado Disaster InfoWhat s Your Home ShowWeatherWeather MapsRadarWeather BlogWeather AcademyWeather RadioSevere Weather ScoresBeat the AceTeam of the WeekAaron s AcesCheerleader ChallengeCommunity MapGood Morning ArkLaMissGuest RecipesGuest Interview Request FormHealth ConnectionsPerfect HomeOur TownService SaluteSubmit Photos and VideosFeed Your SoulRecommend Your Favorite RestaurantMr. FoodTalking FoodTV ListingsS... | negative | -0.2247 |
news.isnull().sum()
url 0 date 0 language 0 title 0 text 0 year 0 month 0 day 0 text_ner 0 text_cleaned 0 text_lemm 0 title_ner 0 title_cleaned 0 title_lemm 0 title_word_count 0 text_word_count 0 vader_sent 0 vader_comp 0 dtype: int64
news.to_parquet('news_vader_sent.parquet')
# Google Cloud Storage details
bucket_name = 'nlp-final'
file_path = 'news_vader_sent.parquet' # This is the name the file will have in GCS
local_file_path = 'news_vader_sent.parquet' # Path to the local file you just saved
# Create a GCS Client
storage_client = storage.Client()
# Get the bucket
bucket = storage_client.get_bucket(bucket_name)
# Create a blob object from the filepath
blob = bucket.blob(file_path)
# Upload the file
blob.upload_from_filename(local_file_path)
3-(A). Sentiment over time: Compound Score¶
3.1. Overall Sentiment (Average of Sentiment from Positive and Negative)¶
1. Sentiment Distribution¶
sentiment_counts = news['vader_sent'].value_counts(ascending=False).reset_index()
sentiment_counts.columns = ['Sentiment', 'Count']
sentiment_counts
| Sentiment | Count | |
|---|---|---|
| 0 | positive | 187561 |
| 1 | negative | 9947 |
| 2 | neutral | 556 |
# Create a bar plot
plt.figure(figsize=(7, 5))
sns.barplot(x='Sentiment', y='Count', data=sentiment_counts)
# Adding title and labels
plt.title('Sentiment Distribution from VADER Analysis')
plt.xlabel('Sentiment')
plt.ylabel('Count')
# Show the plot
plt.show()
# Compute the absolute values of the vader_comp scores
abs_vader_comp = abs(news['vader_comp'])
abs_vader_comp.describe()
count 198064.000000 mean 0.971577 std 0.112334 min 0.000000 25% 0.993000 50% 0.997700 75% 0.999100 max 1.000000 Name: vader_comp, dtype: float64
# Create a distplot
plt.figure(figsize=(7, 5)) # Set the size of the plot
sns.distplot(abs_vader_comp, bins=30, kde=True, hist_kws={'edgecolor':'black'})
# Customize the plot
plt.title('Distribution of Absolute VADER Compound Scores')
plt.xlabel('Absolute Compound Score')
plt.ylabel('Density')
plt.show()
# Compute the absolute values of the vader_comp scores
vader_comp = news['vader_comp']
vader_comp.describe()
count 198064.000000 mean 0.887799 std 0.410358 min -1.000000 25% 0.992500 50% 0.997700 75% 0.999000 max 1.000000 Name: vader_comp, dtype: float64
# Create a distplot
plt.figure(figsize=(7, 5)) # Set the size of the plot
sns.distplot(vader_comp, bins=30, kde=True, hist_kws={'edgecolor':'black'})
# Customize the plot
plt.title('Distribution of VADER Compound Scores')
plt.xlabel('Absolute Compound Score')
plt.ylabel('Density')
plt.show()
2. Sentiment Overtime¶
Year¶
# Group by year and month, and calculate the average sentiment score for each month
yearly_sentiment = news.groupby('year')['vader_comp'].mean().reset_index()
yearly_sentiment.columns = ['Year', 'Average_Sentiment']
yearly_sentiment.head()
| Year | Average_Sentiment | |
|---|---|---|
| 0 | 2020 | 0.877180 |
| 1 | 2021 | 0.900908 |
| 2 | 2022 | 0.914087 |
| 3 | 2023 | 0.877717 |
# Set the style
sns.set(style="white")
# Create a line plot
plt.figure(figsize=(10, 6))
sns.lineplot(data=yearly_sentiment, x='Year', y='Average_Sentiment', marker='o')
# Customize the plot
plt.title('Yearly Average Sentiment Trend', fontsize=16)
plt.xlabel('Year', fontsize=14)
plt.ylabel('Average Sentiment', fontsize=14)
plt.xticks(yearly_sentiment['Year']) # Ensure all years are shown as x-ticks
# Show the plot
plt.show()
# Set the style
sns.set(style="white")
# Create a line plot
plt.figure(figsize=(12, 6))
sns.lineplot(data=news, x='year', y='vader_comp', marker='o')
# Customize the plot
plt.title('Yearly Average Sentiment Over Time', fontsize=16)
plt.xlabel('Year', fontsize=14)
plt.ylabel('Average Sentiment', fontsize=14)
# plt.xticks(rotation=90) # Rotate x-ticks for better readability
# Show the plot
plt.show()
Month¶
# Group by year and month, and calculate the average sentiment score for each month
monthly_sentiment = news.groupby(['year', 'month'])['vader_comp'].mean().reset_index()
monthly_sentiment.columns = ['Year', 'Month', 'Average_Sentiment']
monthly_sentiment.head()
| Year | Month | Average_Sentiment | |
|---|---|---|---|
| 0 | 2020 | 1 | 0.894119 |
| 1 | 2020 | 2 | 0.888097 |
| 2 | 2020 | 3 | 0.925711 |
| 3 | 2020 | 4 | 0.940956 |
| 4 | 2020 | 5 | 0.923580 |
# Custom color palette
custom_colors = ["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd", "#8c564b", "#e377c2", "#7f7f7f", "#bcbd22", "#17becf"]
# Create a larger figure size to prevent overlapping
plt.figure(figsize=(20, 10))
# Create a line plot with the custom color palette
sns.lineplot(x='Month', y='Average_Sentiment', hue='Year', data=monthly_sentiment, marker='o', palette=custom_colors)
# Add titles and labels
plt.title('Monthly Average Sentiment Over Time')
plt.xlabel('Month', fontsize=14)
plt.ylabel('Average Sentiment', fontsize=14)
plt.xticks(range(1, 13), ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']) # Month labels from 1 to 12
# Move the legend outside of the plot
plt.legend(title='Year', bbox_to_anchor=(1.02, 1.02), loc='upper left')
# Adjust subplot parameters for better layout
# plt.subplots_adjust(right=0.8)
# Show the plot
plt.show()
monthly_sentiment['Year_Month'] = monthly_sentiment['Year'].astype(str).str.zfill(2) + '-' + monthly_sentiment['Month'].astype(str).str.zfill(2)
monthly_sentiment.head()
| Year | Month | Average_Sentiment | Year_Month | |
|---|---|---|---|---|
| 0 | 2020 | 1 | 0.894119 | 2020-01 |
| 1 | 2020 | 2 | 0.888097 | 2020-02 |
| 2 | 2020 | 3 | 0.925711 | 2020-03 |
| 3 | 2020 | 4 | 0.940956 | 2020-04 |
| 4 | 2020 | 5 | 0.923580 | 2020-05 |
# Convert 'Year_Month' to a datetime format for better plotting
monthly_sentiment['Year_Month'] = pd.to_datetime(monthly_sentiment['Year_Month'])
# Set the style
sns.set(style="white")
# Create a line plot
plt.figure(figsize=(20, 10))
sns.lineplot(data=monthly_sentiment, x='Year_Month', y='Average_Sentiment', marker='o')
# Customize the plot
plt.title('Monthly Average Sentiment Over Time', fontsize=16)
plt.xlabel('Year-Month', fontsize=14)
plt.ylabel('Average Sentiment', fontsize=14)
# plt.xticks(rotation=90) # Rotate x-ticks for better readability
# Show the plot
plt.show()
# Set the style
sns.set(style="white")
# Create a line plot
plt.figure(figsize=(12, 6))
sns.lineplot(data=news, x='month', y='vader_comp', marker='o')
# Customize the plot
plt.title('Monthly Average Sentiment', fontsize=16)
plt.xlabel('Month', fontsize=14)
plt.ylabel('Average Sentiment', fontsize=14)
# plt.xticks(rotation=90) # Rotate x-ticks for better readability
# Show the plot
plt.show()
Day¶
daily_sentiment = news.groupby(['year', 'month', 'day'])['vader_comp'].mean().reset_index()
daily_sentiment.columns = ['Year', 'Month', 'Day', 'Average_Sentiment']
daily_sentiment.head()
| Year | Month | Day | Average_Sentiment | |
|---|---|---|---|---|
| 0 | 2020 | 1 | 1 | 0.806833 |
| 1 | 2020 | 1 | 2 | 0.671939 |
| 2 | 2020 | 1 | 3 | 0.717555 |
| 3 | 2020 | 1 | 4 | 0.771158 |
| 4 | 2020 | 1 | 5 | 0.852698 |
daily_sentiment['Month_Day'] = daily_sentiment['Month'].astype(str).str.zfill(2) + '-' + daily_sentiment['Day'].astype(str).str.zfill(2)
daily_sentiment.head()
| Year | Month | Day | Average_Sentiment | Month_Day | |
|---|---|---|---|---|---|
| 0 | 2020 | 1 | 1 | 0.806833 | 01-01 |
| 1 | 2020 | 1 | 2 | 0.671939 | 01-02 |
| 2 | 2020 | 1 | 3 | 0.717555 | 01-03 |
| 3 | 2020 | 1 | 4 | 0.771158 | 01-04 |
| 4 | 2020 | 1 | 5 | 0.852698 | 01-05 |
# Custom color palette
custom_colors = ["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd", "#8c564b", "#e377c2", "#7f7f7f", "#bcbd22", "#17becf"]
# Set the style to white (no grid)
sns.set(style="white")
# Create a line plot with a larger figure size
plt.figure(figsize=(20, 10))
sns.lineplot(data=daily_sentiment, x='Month_Day', y='Average_Sentiment', hue='Year', palette=custom_colors)
# Customize the plot
plt.title('Daily Average Sentiment Trend By Year', fontsize=16)
plt.xlabel('Month-Day', fontsize=14)
plt.ylabel('Average Sentiment', fontsize=14)
# Improve x-tick readability
# Show only the first day of each month or every few days
x_ticks = daily_sentiment['Month_Day'].unique()[::10] # Adjust the step as needed
plt.xticks(x_ticks, rotation=90) # Rotate x-ticks for better readability
# Place the legend outside the plot
plt.legend(title='Year', bbox_to_anchor=(1.01, 1.01), loc='upper left')
# Adjust subplot parameters for better layout
plt.subplots_adjust(right=0.8)
# Show the plot
plt.show()
daily_sentiment2 = news.groupby('date')['vader_comp'].mean().reset_index()
daily_sentiment2.columns = ['Date', 'Average_Sentiment']
daily_sentiment2.head()
| Date | Average_Sentiment | |
|---|---|---|
| 0 | 2020-01-01 | 0.806833 |
| 1 | 2020-01-02 | 0.671939 |
| 2 | 2020-01-03 | 0.717555 |
| 3 | 2020-01-04 | 0.771158 |
| 4 | 2020-01-05 | 0.852698 |
# Set the style
sns.set(style="white")
# Create a line plot
plt.figure(figsize=(20, 10))
sns.lineplot(data=daily_sentiment2, x='Date', y='Average_Sentiment')
# Customize the plot
plt.title('Daily Average Sentiment Over Time', fontsize=16)
plt.xlabel('Date', fontsize=14)
plt.ylabel('Average Sentiment', fontsize=14)
x_ticks = daily_sentiment2['Date'].unique()[::30] # Adjust the step as needed
plt.xticks(x_ticks, rotation=90) # Rotate x-ticks for better readability
# Show the plot
plt.show()
# Set the style
sns.set(style="white")
# Create a line plot
plt.figure(figsize=(12, 6))
sns.lineplot(data=news, x='day', y='vader_comp', marker='o')
# Customize the plot
plt.title('Daily Average Sentiment', fontsize=16)
plt.xlabel('Day', fontsize=14)
plt.ylabel('Average Sentiment', fontsize=14)
# plt.xticks(rotation=90) # Rotate x-ticks for better readability
# Show the plot
plt.show()
3.2. Positive Sentiment (Average of Sentiment from Positive)¶
1. Sentiment Distribution¶
news_po = news[news['vader_sent'] == 'positive']
po_vader_comp = news_po['vader_comp']
po_vader_comp.describe()
count 187561.000000 mean 0.981748 std 0.079434 min 0.001500 25% 0.994200 50% 0.997900 75% 0.999100 max 1.000000 Name: vader_comp, dtype: float64
# Create a distplot
plt.figure(figsize=(7, 5)) # Set the size of the plot
sns.distplot(po_vader_comp, bins=30, kde=True, hist_kws={'edgecolor':'black'})
# Customize the plot
plt.title('Distribution of VADER Compound Scores from Positive Sentiment')
plt.xlabel('Compound Score')
plt.ylabel('Density')
plt.show()
2. Sentiment Overtime¶
Year¶
# Group by year and month, and calculate the average sentiment score for each month
yearly_sentiment = news_po.groupby('year')['vader_comp'].mean().reset_index()
yearly_sentiment.columns = ['Year', 'Average_Sentiment']
yearly_sentiment.head()
| Year | Average_Sentiment | |
|---|---|---|
| 0 | 2020 | 0.979042 |
| 1 | 2021 | 0.986038 |
| 2 | 2022 | 0.984295 |
| 3 | 2023 | 0.980291 |
# Set the style
sns.set(style="white")
# Create a line plot
plt.figure(figsize=(10, 6))
sns.lineplot(data=yearly_sentiment, x='Year', y='Average_Sentiment', marker='o')
# Customize the plot
plt.title('Yearly Average Sentiment Trend from Positive Sentiment', fontsize=16)
plt.xlabel('Year', fontsize=14)
plt.ylabel('Average Sentiment', fontsize=14)
plt.xticks(yearly_sentiment['Year']) # Ensure all years are shown as x-ticks
# Show the plot
plt.show()
# Set the style
sns.set(style="white")
# Create a line plot
plt.figure(figsize=(12, 6))
sns.lineplot(data=news_po, x='year', y='vader_comp', marker='o')
# Customize the plot
plt.title('Yearly Average Sentiment Over Time from Positive Sentiment', fontsize=16)
plt.xlabel('Year', fontsize=14)
plt.ylabel('Average Sentiment', fontsize=14)
# plt.xticks(rotation=90) # Rotate x-ticks for better readability
# Show the plot
plt.show()
Month¶
# Group by year and month, and calculate the average sentiment score for each month
monthly_sentiment = news_po.groupby(['year', 'month'])['vader_comp'].mean().reset_index()
monthly_sentiment.columns = ['Year', 'Month', 'Average_Sentiment']
monthly_sentiment.head()
| Year | Month | Average_Sentiment | |
|---|---|---|---|
| 0 | 2020 | 1 | 0.981678 |
| 1 | 2020 | 2 | 0.977027 |
| 2 | 2020 | 3 | 0.985241 |
| 3 | 2020 | 4 | 0.986130 |
| 4 | 2020 | 5 | 0.986621 |
# Custom color palette
custom_colors = ["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd", "#8c564b", "#e377c2", "#7f7f7f", "#bcbd22", "#17becf"]
# Create a larger figure size to prevent overlapping
plt.figure(figsize=(20, 10))
# Create a line plot with the custom color palette
sns.lineplot(x='Month', y='Average_Sentiment', hue='Year', data=monthly_sentiment, marker='o', palette=custom_colors)
# Add titles and labels
plt.title('Monthly Average Sentiment by Year from Positive Sentiment', fontsize=16)
plt.xlabel('Month', fontsize=14)
plt.ylabel('Average Sentiment', fontsize=14)
plt.xticks(range(1, 13), ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']) # Month labels from 1 to 12
# Move the legend outside of the plot
plt.legend(title='Year', bbox_to_anchor=(1.02, 1.02), loc='upper left')
# Adjust subplot parameters for better layout
# plt.subplots_adjust(right=0.8)
# Show the plot
plt.show()
monthly_sentiment['Year_Month'] = monthly_sentiment['Year'].astype(str).str.zfill(2) + '-' + monthly_sentiment['Month'].astype(str).str.zfill(2)
monthly_sentiment.head()
| Year | Month | Average_Sentiment | Year_Month | |
|---|---|---|---|---|
| 0 | 2020 | 1 | 0.981678 | 2020-01 |
| 1 | 2020 | 2 | 0.977027 | 2020-02 |
| 2 | 2020 | 3 | 0.985241 | 2020-03 |
| 3 | 2020 | 4 | 0.986130 | 2020-04 |
| 4 | 2020 | 5 | 0.986621 | 2020-05 |
# Convert 'Year_Month' to a datetime format for better plotting
monthly_sentiment['Year_Month'] = pd.to_datetime(monthly_sentiment['Year_Month'])
# Set the style
sns.set(style="white")
# Create a line plot
plt.figure(figsize=(20, 10))
sns.lineplot(data=monthly_sentiment, x='Year_Month', y='Average_Sentiment', marker='o')
# Customize the plot
plt.title('Monthly Average Sentiment Over Time from Positive Sentiment', fontsize=16)
plt.xlabel('Year-Month', fontsize=14)
plt.ylabel('Average Sentiment', fontsize=14)
# plt.xticks(rotation=90) # Rotate x-ticks for better readability
# Show the plot
plt.show()
# Set the style
sns.set(style="white")
# Create a line plot
plt.figure(figsize=(12, 6))
sns.lineplot(data=news_po, x='month', y='vader_comp', marker='o')
# Customize the plot
plt.title('Monthly Average Sentiment from Positive Sentiment', fontsize=16)
plt.xlabel('Month', fontsize=14)
plt.ylabel('Average Sentiment', fontsize=14)
# plt.xticks(rotation=90) # Rotate x-ticks for better readability
# Show the plot
plt.show()
Day¶
daily_sentiment = news_po.groupby(['year', 'month', 'day'])['vader_comp'].mean().reset_index()
daily_sentiment.columns = ['Year', 'Month', 'Day', 'Average_Sentiment']
daily_sentiment.head()
| Year | Month | Day | Average_Sentiment | |
|---|---|---|---|---|
| 0 | 2020 | 1 | 1 | 0.970577 |
| 1 | 2020 | 1 | 2 | 0.945592 |
| 2 | 2020 | 1 | 3 | 0.980421 |
| 3 | 2020 | 1 | 4 | 0.979463 |
| 4 | 2020 | 1 | 5 | 0.973589 |
daily_sentiment['Month_Day'] = daily_sentiment['Month'].astype(str).str.zfill(2) + '-' + daily_sentiment['Day'].astype(str).str.zfill(2)
daily_sentiment.head()
| Year | Month | Day | Average_Sentiment | Month_Day | |
|---|---|---|---|---|---|
| 0 | 2020 | 1 | 1 | 0.970577 | 01-01 |
| 1 | 2020 | 1 | 2 | 0.945592 | 01-02 |
| 2 | 2020 | 1 | 3 | 0.980421 | 01-03 |
| 3 | 2020 | 1 | 4 | 0.979463 | 01-04 |
| 4 | 2020 | 1 | 5 | 0.973589 | 01-05 |
# Custom color palette
custom_colors = ["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd", "#8c564b", "#e377c2", "#7f7f7f", "#bcbd22", "#17becf"]
# Set the style to white (no grid)
sns.set(style="white")
# Create a line plot with a larger figure size
plt.figure(figsize=(20, 10))
sns.lineplot(data=daily_sentiment, x='Month_Day', y='Average_Sentiment', hue='Year', palette=custom_colors)
# Customize the plot
plt.title('Daily Average Sentiment Trend by Year from Positive Sentiment', fontsize=16)
plt.xlabel('Month-Day', fontsize=14)
plt.ylabel('Average Sentiment', fontsize=14)
# Improve x-tick readability
# Show only the first day of each month or every few days
x_ticks = daily_sentiment['Month_Day'].unique()[::10] # Adjust the step as needed
plt.xticks(x_ticks, rotation=90) # Rotate x-ticks for better readability
# Place the legend outside the plot
plt.legend(title='Year', bbox_to_anchor=(1.01, 1.01), loc='upper left')
# Adjust subplot parameters for better layout
plt.subplots_adjust(right=0.8)
# Show the plot
plt.show()
daily_sentiment2 = news_po.groupby('date')['vader_comp'].mean().reset_index()
daily_sentiment2.columns = ['Date', 'Average_Sentiment']
daily_sentiment2.head()
| Date | Average_Sentiment | |
|---|---|---|
| 0 | 2020-01-01 | 0.970577 |
| 1 | 2020-01-02 | 0.945592 |
| 2 | 2020-01-03 | 0.980421 |
| 3 | 2020-01-04 | 0.979463 |
| 4 | 2020-01-05 | 0.973589 |
# Set the style
sns.set(style="white")
# Create a line plot
plt.figure(figsize=(20, 10))
sns.lineplot(data=daily_sentiment2, x='Date', y='Average_Sentiment')
# Customize the plot
plt.title('Daily Average Sentiment Over Time from Positive Sentiment', fontsize=16)
plt.xlabel('Date', fontsize=14)
plt.ylabel('Average Sentiment', fontsize=14)
x_ticks = daily_sentiment2['Date'].unique()[::30] # Adjust the step as needed
plt.xticks(x_ticks, rotation=90) # Rotate x-ticks for better readability
# Show the plot
plt.show()
# Set the style
sns.set(style="white")
# Create a line plot
plt.figure(figsize=(12, 6))
sns.lineplot(data=news_po, x='day', y='vader_comp', marker='o')
# Customize the plot
plt.title('Daily Average Sentiment from Positive Sentiment', fontsize=16)
plt.xlabel('Day', fontsize=14)
plt.ylabel('Average Sentiment', fontsize=14)
# plt.xticks(rotation=90) # Rotate x-ticks for better readability
# Show the plot
plt.show()
3.2. Positive Sentiment (Average of Sentiment from Positive)¶
1. Sentiment Distribution¶
news_ne = news[news['vader_sent'] == 'negative']
ne_vader_comp = news_ne['vader_comp']
ne_vader_comp.describe()
count 9947.000000 mean -0.834087 std 0.242236 min -1.000000 25% -0.989600 50% -0.955200 75% -0.788400 max -0.001800 Name: vader_comp, dtype: float64
# Create a distplot
plt.figure(figsize=(7, 5)) # Set the size of the plot
sns.distplot(ne_vader_comp, bins=30, kde=True, hist_kws={'edgecolor':'black'})
# Customize the plot
plt.title('Distribution of VADER Compound Scores from Negative Sentiment')
plt.xlabel('Compound Score')
plt.ylabel('Density')
plt.show()
2. Sentiment Overtime¶
Year¶
# Group by year and month, and calculate the average sentiment score for each month
yearly_sentiment = news_ne.groupby('year')['vader_comp'].mean().reset_index()
yearly_sentiment.columns = ['Year', 'Average_Sentiment']
yearly_sentiment.head()
| Year | Average_Sentiment | |
|---|---|---|
| 0 | 2020 | -0.819097 |
| 1 | 2021 | -0.849497 |
| 2 | 2022 | -0.848392 |
| 3 | 2023 | -0.830405 |
# Set the style
sns.set(style="white")
# Create a line plot
plt.figure(figsize=(10, 6))
sns.lineplot(data=yearly_sentiment, x='Year', y='Average_Sentiment', marker='o')
# Customize the plot
plt.title('Yearly Average Sentiment Trend from Negative Sentiment', fontsize=16)
plt.xlabel('Year', fontsize=14)
plt.ylabel('Average Sentiment', fontsize=14)
plt.xticks(yearly_sentiment['Year']) # Ensure all years are shown as x-ticks
# Show the plot
plt.show()
# Set the style
sns.set(style="white")
# Create a line plot
plt.figure(figsize=(12, 6))
sns.lineplot(data=news_ne, x='year', y='vader_comp', marker='o')
# Customize the plot
plt.title('Yearly Average Sentiment Over Time from Negative Sentiment', fontsize=16)
plt.xlabel('Year', fontsize=14)
plt.ylabel('Average Sentiment', fontsize=14)
# plt.xticks(rotation=90) # Rotate x-ticks for better readability
# Show the plot
plt.show()
Month¶
# Group by year and month, and calculate the average sentiment score for each month
monthly_sentiment = news_ne.groupby(['year', 'month'])['vader_comp'].mean().reset_index()
monthly_sentiment.columns = ['Year', 'Month', 'Average_Sentiment']
monthly_sentiment.head()
| Year | Month | Average_Sentiment | |
|---|---|---|---|
| 0 | 2020 | 1 | -0.849019 |
| 1 | 2020 | 2 | -0.802590 |
| 2 | 2020 | 3 | -0.739436 |
| 3 | 2020 | 4 | -0.774740 |
| 4 | 2020 | 5 | -0.799248 |
# Custom color palette
custom_colors = ["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd", "#8c564b", "#e377c2", "#7f7f7f", "#bcbd22", "#17becf"]
# Create a larger figure size to prevent overlapping
plt.figure(figsize=(20, 10))
# Create a line plot with the custom color palette
sns.lineplot(x='Month', y='Average_Sentiment', hue='Year', data=monthly_sentiment, marker='o', palette=custom_colors)
# Add titles and labels
plt.title('Monthly Average Sentiment by Year from Negative Sentiment')
plt.xlabel('Month', fontsize=14)
plt.ylabel('Average Sentiment', fontsize=14)
plt.xticks(range(1, 13), ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']) # Month labels from 1 to 12
# Move the legend outside of the plot
plt.legend(title='Year', bbox_to_anchor=(1.02, 1.02), loc='upper left')
# Adjust subplot parameters for better layout
# plt.subplots_adjust(right=0.8)
# Show the plot
plt.show()
monthly_sentiment['Year_Month'] = monthly_sentiment['Year'].astype(str).str.zfill(2) + '-' + monthly_sentiment['Month'].astype(str).str.zfill(2)
monthly_sentiment.head()
| Year | Month | Average_Sentiment | Year_Month | |
|---|---|---|---|---|
| 0 | 2020 | 1 | -0.849019 | 2020-01 |
| 1 | 2020 | 2 | -0.802590 | 2020-02 |
| 2 | 2020 | 3 | -0.739436 | 2020-03 |
| 3 | 2020 | 4 | -0.774740 | 2020-04 |
| 4 | 2020 | 5 | -0.799248 | 2020-05 |
# Convert 'Year_Month' to a datetime format for better plotting
monthly_sentiment['Year_Month'] = pd.to_datetime(monthly_sentiment['Year_Month'])
# Set the style
sns.set(style="white")
# Create a line plot
plt.figure(figsize=(20, 10))
sns.lineplot(data=monthly_sentiment, x='Year_Month', y='Average_Sentiment', marker='o')
# Customize the plot
plt.title('Monthly Average Sentiment Over Time from Negative Sentiment', fontsize=16)
plt.xlabel('Year-Month', fontsize=14)
plt.ylabel('Average Sentiment', fontsize=14)
# plt.xticks(rotation=90) # Rotate x-ticks for better readability
# Show the plot
plt.show()
# Set the style
sns.set(style="white")
# Create a line plot
plt.figure(figsize=(12, 6))
sns.lineplot(data=news_po, x='month', y='vader_comp', marker='o')
# Customize the plot
plt.title('Monthly Average Sentiment from Negative Sentiment', fontsize=16)
plt.xlabel('Month', fontsize=14)
plt.ylabel('Average Sentiment', fontsize=14)
# plt.xticks(rotation=90) # Rotate x-ticks for better readability
# Show the plot
plt.show()
Day¶
daily_sentiment = news_ne.groupby(['year', 'month', 'day'])['vader_comp'].mean().reset_index()
daily_sentiment.columns = ['Year', 'Month', 'Day', 'Average_Sentiment']
daily_sentiment.head()
| Year | Month | Day | Average_Sentiment | |
|---|---|---|---|---|
| 0 | 2020 | 1 | 1 | -0.994350 |
| 1 | 2020 | 1 | 2 | -0.895345 |
| 2 | 2020 | 1 | 3 | -0.969167 |
| 3 | 2020 | 1 | 4 | -0.634900 |
| 4 | 2020 | 1 | 5 | -0.537550 |
daily_sentiment['Month_Day'] = daily_sentiment['Month'].astype(str).str.zfill(2) + '-' + daily_sentiment['Day'].astype(str).str.zfill(2)
daily_sentiment.head()
| Year | Month | Day | Average_Sentiment | Month_Day | |
|---|---|---|---|---|---|
| 0 | 2020 | 1 | 1 | -0.994350 | 01-01 |
| 1 | 2020 | 1 | 2 | -0.895345 | 01-02 |
| 2 | 2020 | 1 | 3 | -0.969167 | 01-03 |
| 3 | 2020 | 1 | 4 | -0.634900 | 01-04 |
| 4 | 2020 | 1 | 5 | -0.537550 | 01-05 |
# Custom color palette
custom_colors = ["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd", "#8c564b", "#e377c2", "#7f7f7f", "#bcbd22", "#17becf"]
# Set the style to white (no grid)
sns.set(style="white")
# Create a line plot with a larger figure size
plt.figure(figsize=(20, 10))
sns.lineplot(data=daily_sentiment, x='Month_Day', y='Average_Sentiment', hue='Year', palette=custom_colors)
# Customize the plot
plt.title('Daily Average Sentiment Trend By Year from Negative Sentiment', fontsize=16)
plt.xlabel('Month-Day', fontsize=14)
plt.ylabel('Average Sentiment', fontsize=14)
# Improve x-tick readability
# Show only the first day of each month or every few days
x_ticks = daily_sentiment['Month_Day'].unique()[::10] # Adjust the step as needed
plt.xticks(x_ticks, rotation=90) # Rotate x-ticks for better readability
# Place the legend outside the plot
plt.legend(title='Year', bbox_to_anchor=(1.01, 1.01), loc='upper left')
# Adjust subplot parameters for better layout
plt.subplots_adjust(right=0.8)
# Show the plot
plt.show()
daily_sentiment2 = news_ne.groupby('date')['vader_comp'].mean().reset_index()
daily_sentiment2.columns = ['Date', 'Average_Sentiment']
daily_sentiment2.head()
| Date | Average_Sentiment | |
|---|---|---|
| 0 | 2020-01-01 | -0.994350 |
| 1 | 2020-01-02 | -0.895345 |
| 2 | 2020-01-03 | -0.969167 |
| 3 | 2020-01-04 | -0.634900 |
| 4 | 2020-01-05 | -0.537550 |
# Set the style
sns.set(style="white")
# Create a line plot
plt.figure(figsize=(20, 10))
sns.lineplot(data=daily_sentiment2, x='Date', y='Average_Sentiment')
# Customize the plot
plt.title('Daily Average Sentiment Over Time from Negative Sentiment', fontsize=16)
plt.xlabel('Date', fontsize=14)
plt.ylabel('Average Sentiment', fontsize=14)
x_ticks = daily_sentiment2['Date'].unique()[::30] # Adjust the step as needed
plt.xticks(x_ticks, rotation=90) # Rotate x-ticks for better readability
# Show the plot
plt.show()
# Set the style
sns.set(style="white")
# Create a line plot
plt.figure(figsize=(12, 6))
sns.lineplot(data=news_ne, x='day', y='vader_comp', marker='o')
# Customize the plot
plt.title('Daily Average Sentiment from Negative Sentiment', fontsize=16)
plt.xlabel('Day', fontsize=14)
plt.ylabel('Average Sentiment', fontsize=14)
# plt.xticks(rotation=90) # Rotate x-ticks for better readability
# Show the plot
plt.show()
3-(B). Sentiment over time: Article Numbers¶
news.groupby('year')['vader_sent'].count()
year 2020 22836 2021 28962 2022 36775 2023 109491 Name: vader_sent, dtype: int64
grouped_data_po = news_po.groupby('year')['vader_sent'].size().reset_index(name = 'count')
grouped_data_po.head()
| year | count | |
|---|---|---|
| 0 | 2020 | 21413 |
| 1 | 2021 | 27591 |
| 2 | 2022 | 35325 |
| 3 | 2023 | 103232 |
# Set the style
sns.set(style="white")
# Create a bar plot
plt.figure(figsize=(10, 6))
sns.barplot(data=grouped_data_po, x='year', y='count')
# Customize the plot
plt.title('News Article Count(Yearly) from Positive Sentiment', fontsize=16)
plt.xlabel('Year', fontsize=14)
plt.ylabel('Count', fontsize=14)
# Show the plot
plt.show()
grouped_data_ne = news_ne.groupby('year')['vader_sent'].size().reset_index(name = 'count')
grouped_data_ne.head()
| year | count | |
|---|---|---|
| 0 | 2020 | 1139 |
| 1 | 2021 | 1311 |
| 2 | 2022 | 1361 |
| 3 | 2023 | 6136 |
# Set the style
sns.set(style="white")
# Create a bar plot
plt.figure(figsize=(10, 6))
sns.barplot(data=grouped_data_ne, x='year', y='count')
# Customize the plot
plt.title('News Article Count(Yearly) from Negative Sentiment', fontsize=16)
plt.xlabel('Year', fontsize=14)
plt.ylabel('Count', fontsize=14)
# Show the plot
plt.show()
# Create a pivot table
pivot_data = news.pivot_table(index='year', columns='vader_sent', aggfunc='size', fill_value=0)
pivot_data.head()
| vader_sent | negative | neutral | positive |
|---|---|---|---|
| year | |||
| 2020 | 1139 | 284 | 21413 |
| 2021 | 1311 | 60 | 27591 |
| 2022 | 1361 | 89 | 35325 |
| 2023 | 6136 | 123 | 103232 |
sns.set(style="white")
# Create a line plot
plt.figure(figsize=(10, 5))
sns.lineplot(data=pivot_data, markers=True, dashes=False)
# Customize the plot
plt.title('Yearly News Article Count by Sentiment', fontsize=16)
plt.xlabel('Year', fontsize=14)
plt.ylabel('Count', fontsize=14)
# Place the legend outside of the plot to the right
plt.legend(title='Sentiment', loc='upper left', bbox_to_anchor=(1.01, 1.02))
# Adjust subplot parameters to fit the legend
plt.subplots_adjust(right=0.75)
# Show the plot
plt.show()
# Combine year and month into a single column
news['year_month'] = news['year'].astype(str) + '-' + news['month'].astype(str).str.zfill(2)
# Group by year_month and vader_sent, and count the occurrences
grouped_data = news.groupby(['year_month', 'vader_sent']).size().reset_index(name='count')
# Set the style
sns.set(style="white")
# Define the sentiments
sentiments = ['positive', 'negative', 'neutral']
# Create separate plots for each sentiment
for sentiment in sentiments:
# Filter data for the current sentiment
data_filtered = grouped_data[grouped_data['vader_sent'] == sentiment]
# Create a bar plot for the current sentiment
plt.figure(figsize=(10, 5))
barplot = sns.barplot(data=data_filtered, x='year_month', y='count')
# Customize the plot
plt.title(f'Monthly Article Count ({sentiment.capitalize()} Sentiment)', fontsize=16)
plt.xlabel('Year-Month', fontsize=14)
plt.ylabel('Article Count', fontsize=14)
# Rotate and skim x-ticks
xtick_labels = barplot.get_xticklabels()
skim_factor = 5 # Adjust this value as needed to skip x-ticks
barplot.set_xticklabels([label if i % skim_factor == 0 else '' for i, label in enumerate(xtick_labels)], rotation=90)
# Place the legend outside of the plot
# plt.legend(title='Sentiment', bbox_to_anchor=(1.01, 1.02), loc='upper left')
# Show the plot
plt.show()
# Pivot the data for stacked bar plot
pivot_data = grouped_data.pivot(index='year_month', columns='vader_sent', values='count').fillna(0)
# Extend the custom color palette
custom_colors = ["#1f77b4", "#ff7f0e", "#2ca02c"] # Add more colors as needed
# Create a stacked bar plot with an adjusted figure size
plt.figure(figsize=(20, 10))
pivot_data.plot(kind='bar', stacked=True, color=custom_colors)
# Customize the plot
plt.title('Monthly Total Article Count with Sentiment Portions', fontsize=16)
plt.xlabel('Year-Month', fontsize=14)
plt.ylabel('Total Article Count', fontsize=14)
# Rotate and skim x-ticks
plt.xticks(rotation=90)
xtick_labels = plt.gca().get_xticklabels()
skim_factor = 5 # Adjust this value as needed
plt.gca().set_xticklabels([label if i % skim_factor == 0 else '' for i, label in enumerate(xtick_labels)])
# Place the legend outside of the plot
plt.legend(title='Sentiment', bbox_to_anchor=(1.01, 1.02), loc='upper left')
# Show the plot
plt.show()
<Figure size 2000x1000 with 0 Axes>
4. Word Count¶
4.1. Original Data¶
# Set the style
sns.set(style="white")
# Create a box plot
plt.figure(figsize=(18, 8))
sns.boxplot(data=news, x='vader_sent', y='text_word_count')
# Customize the plot
plt.title('Word Count Distribution by Sentiment', fontsize=16)
plt.xlabel('Sentiment', fontsize=14)
plt.ylabel('Word Count', fontsize=14)
# Show the plot
plt.show()
# Create a violin plot
plt.figure(figsize=(12, 8))
sns.violinplot(data=news, x='vader_sent', y='text_word_count')
# Customize the plot
plt.title('Word Count Distribution by Sentiment', fontsize=16)
plt.xlabel('Sentiment', fontsize=14)
plt.ylabel('Word Count', fontsize=14)
# Show the plot
plt.show()
# Assuming news is your DataFrame and it contains 'vader_sent' and 'text_word_count' columns
plt.figure(figsize=(10, 6))
sentiments = news['vader_sent'].unique() # Get unique sentiment categories
# Define custom colors for each sentiment category
colors = ['green', 'red', 'gray'] # Adjust the number of colors based on the number of sentiment categories
for i, sentiment in enumerate(sentiments):
data = news[news['vader_sent'] == sentiment] # Filter data for each sentiment category
sns.histplot(data=data, x='text_word_count', label=sentiment, color=colors[i], bins=30, stat='density', element='step')
plt.xlabel('Text Word Count')
plt.ylabel('Density')
plt.title('Distribution of Text Word Count by Sentiment')
plt.legend(title='Sentiment')
plt.show()
4.2. Data without Outliers¶
plt.figure(figsize=(12, 8))
sns.boxplot(data=news, x='vader_sent', y='text_word_count', showfliers=False)
plt.title('Word Count Distribution by Sentiment (Without Outliers)')
plt.xlabel('Sentiment')
plt.ylabel('Word Count')
plt.show()
plt.figure(figsize=(12, 8))
sns.violinplot(data=news, x='vader_sent', y='text_word_count', cut=0)
plt.title('Word Count Distribution by Sentiment (Violin Plot)')
plt.xlabel('Sentiment')
plt.ylabel('Word Count')
plt.show()
news[news['vader_sent'] == 'positive']['text_word_count'].describe()
count 187561.000000 mean 811.295456 std 611.917261 min 4.000000 25% 488.000000 50% 672.000000 75% 984.000000 max 29325.000000 Name: text_word_count, dtype: float64
news[news['vader_sent'] == 'negative']['text_word_count'].describe()
count 9947.000000 mean 754.631447 std 528.337370 min 5.000000 25% 421.000000 50% 628.000000 75% 976.000000 max 10490.000000 Name: text_word_count, dtype: float64
news[news['vader_sent'] == 'neutral']['text_word_count'].describe()
count 556.000000 mean 57.557554 std 166.231279 min 3.000000 25% 10.000000 50% 12.000000 75% 16.000000 max 1402.000000 Name: text_word_count, dtype: float64
def calculate_outlier_thresholds(series):
Q1 = series.quantile(0.25)
Q3 = series.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
return lower_bound, upper_bound
# Calculate thresholds for each sentiment category
positive_thresholds = calculate_outlier_thresholds(news[news['vader_sent'] == 'positive']['text_word_count'])
negative_thresholds = calculate_outlier_thresholds(news[news['vader_sent'] == 'negative']['text_word_count'])
neutral_thresholds = calculate_outlier_thresholds(news[news['vader_sent'] == 'neutral']['text_word_count'])
print(positive_thresholds)
print(negative_thresholds)
print(neutral_thresholds)
(-256.0, 1728.0) (-411.5, 1808.5) (1.0, 25.0)
# Set the style
sns.set(style="white")
# Define the figure size
plt.figure(figsize=(10, 6))
# Filter out text_word_count values exceeding 2000
filtered_news = news[news['text_word_count'] <= 2000]
# Define custom colors for each sentiment category
colors = ['green', 'red', 'gray'] # Make sure the number of colors matches the number of sentiment categories
# Get unique sentiment categories
sentiments = filtered_news['vader_sent'].unique()
# Plot overlapping histograms for each sentiment category
for sentiment, color in zip(sentiments, colors):
# Filter data for each sentiment category
data = filtered_news[filtered_news['vader_sent'] == sentiment]['text_word_count']
sns.histplot(data, label=sentiment, color=color, element='step', stat='count', common_norm=False, binwidth=50)
# Customize the plot
plt.xlabel('Text Word Count')
plt.ylabel('Article Count')
plt.title('Distribution of Text Word Count by Sentiment (Word Count ≤ 2000)')
# Place the legend outside the plot
plt.legend(title='Sentiment', bbox_to_anchor=(1.01, 1.02), loc='upper left')
# Show the plot
plt.tight_layout() # Adjust the layout
plt.show()
5. Word Cloud¶
from wordcloud import WordCloud
# Function to generate word cloud
def generate_wordcloud(text, title):
wordcloud = WordCloud(width = 800, height = 400, background_color ='white').generate(text)
plt.figure(figsize = (10, 5), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
plt.title(title, fontsize=20)
plt.show()
# Replace 'positive' with 'negative' or 'neutral' as needed
sentiment_text = " ".join(text for text in news[news['vader_sent'] == 'positive']['text_lemm'])
# Generate and plot the word cloud
wordcloud_sentiment = WordCloud(width=800, height=400, background_color='white', max_words=100).generate(sentiment_text)
plt.figure(figsize=(12, 8))
plt.imshow(wordcloud_sentiment, interpolation='bilinear')
plt.title('Positive Sentiment Word Cloud', fontsize=20)
plt.axis('off')
plt.show()
sentiment_text = " ".join(text for text in news[news['vader_sent'] == 'negative']['text_lemm'])
# Generate and plot the word cloud
wordcloud_sentiment = WordCloud(width=800, height=400, background_color='white', max_words=100).generate(sentiment_text)
plt.figure(figsize=(12, 8))
plt.imshow(wordcloud_sentiment, interpolation='bilinear')
plt.title('Negative Sentiment Word Cloud', fontsize=20)
plt.axis('off')
plt.show()
sentiment_text = " ".join(text for text in news[news['vader_sent'] == 'neutral']['text_lemm'])
# Generate and plot the word cloud
wordcloud_sentiment = WordCloud(width=800, height=400, background_color='white', max_words=100).generate(sentiment_text)
plt.figure(figsize=(12, 8))
plt.imshow(wordcloud_sentiment, interpolation='bilinear')
plt.title('Neutral Sentiment Word Cloud', fontsize=20)
plt.axis('off')
plt.show()